PDF2XML: Converting PDF to XML

نویسندگان

  • Yonggao Yang
  • Kwang H. Paick
  • Yanxiong Peng
  • Yukong Zhang
چکیده

XML is a markup language for documents containing structured information. It is designed to make it easy to interchange structured documents over the Internet and further integrate them with management database system. PDF is a document format intended to electronically reproduce the look of a page. There is a huge demand of converting existing PDF documents into XML documents, so that they will be searchable and manageable. Since PDF is basically a page layout format and does not carry original document structure, converting PDF to XML remains a challenging task. This paper addresses the related technique problems and explores approaches. As part of the Data Conversion Project under development at the Data Conversion Center funded by DoD, we present a system, PDF2XML, designed to automatically perform the conversion with minimum human interaction.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A System for Converting PDF Documents into Structured XML Format

We present in this paper a system for converting PDF legacy documents into structured XML format. This conversion system first extracts the different streams contained in PDF files (text, bitmap and vectorial images) and then applies different components in order to express in XML the logically structured documents. Some of these components are traditional in Document Analysis, other more speci...

متن کامل

Prototyping a Vibrato-Aware Query-By-Humming (QBH) Music Information Retrieval System for Mobile Communication Devices: Case of Chromatic Harmonica

Background and Aim: The current research aims at prototyping query-by-humming music information retrieval systems for smart phones. Methods: This multi-method research follows simulation technique from mixed models of the operations research methodology, and the documentary research method, simultaneously. Two chromatic harmonica albums comprised the research population. To achieve the purpose ...

متن کامل

Creating Structured PDF Files

This paper describes a tool for recombining the logical structure from an XML document with the typeset appearance of the corresponding PDF document. The tool uses the XML representation as a template for the insertion of the logical structure into the existing PDF document, thereby creating a Structured/Tagged PDF. The addition of logical structure adds value to the PDF in three ways: the acce...

متن کامل

Tralics, a LTEX to XML Translator

In this paper we describe Tralics, a LTEX to XML translator. A previous version of the software (written in Perl) was used to obtain the pdf version of Inria’s “Rapport d’Activité” for year 2001. The current version of the software (written in C++) is used for both the HTML and pdf version for the year 2002. The XML generated by Tralics conforms to a local DTD, similar to the TEI; it was conver...

متن کامل

Converting Relational Database into XML Document

XML (Extensible Markup Language) is emerging and gradually accepted as the standard for data interchange in the Internet world. Interoperation of relational database and XML database involves schema and data translations. Through EER (extended entity relationship) model can convert the schema of relational database into XML. The semantics of the relational database, captured in EER diagram, are...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004